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Abstract 

We  consider  a  receding  horizon  approach  as  an  approximate  solution  to  two-person  zero-sum 
Markov  games  with  infinite  horizon  discounted  cost  and  average  cost  criteria.  We  first  present 
error  bounds  from  the  optimal  equilibrium  value  of  the  game  when  both  players  take  correlated 
equilibrium  receding  horizon  policies  that  are  based  on  exact  or  approximate  solutions  of  receding 
finite  horizon  subgames.  Motivated  by  the  worst-case  optimal  control  of  queueing  systems  by 
Altman  [1] ,  we  then  analyze  error  bounds  when  the  minimizer  plays  the  (approximate)  receding 
horizon  control  and  the  maximizer  plays  the  worst  case  policy.  We  give  three  heuristic  examples 
of  the  approximate  receding  horizon  control.  We  extend  “rollout”  by  Bertsekas  and  Castanon  [9] 
and  “parallel  rollout”  and  “hindsight  optimization”  by  Chang  et  al.  [13,  16]  into  the  Markov 
game  setting  within  the  framework  of  the  approximate  receding  horizon  approach  and  analyze 
their  performances.  From  the  rollout/parallel  rollout  approaches,  the  minimizing  player  seeks 
to  improve  the  performance  of  a  single  heuristic  policy  it  rolls  out  or  to  combine  dynamically 
multiple  heuristic  policies  in  a  set  to  improve  the  performances  of  all  of  the  heuristic  policies 
simultaneously  under  the  guess  that  the  maximizing  player  has  chosen  a  fixed  worst-case  policy. 

Given  e  >  0,  we  give  the  value  of  the  receding  horizon  which  guarantees  that  the  parallel  rollout 
policy  with  the  horizon  played  by  the  minimizer  dominates  any  heuristic  policy  in  the  set  by 
e.  From  the  hindsight  optimization  approach,  the  minimizing  player  makes  a  decision  based  on 
his  expected  optimal  hindsight  performance  over  a  finite  horizon.  We  finally  discuss  practical 
implementations  of  the  receding  horizon  approaches  via  simulation. 

Keywords:  Markov  game,  receding  horizon  control,  infinite  horizon  cost,  rollout,  hindsight  opti¬ 
mization 
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1  Introduction 

Game  Theory  has  been  used  to  model  dynamic  sequential  decision  making  problems  in  a  wide 
variety  of  situations  where  multiple  decision  makers  compete  or  cooperate  to  optimize  their  cost 
functionals.  In  this  paper,  we  consider  games  with  two  players  where  one  player  (the  minimizer) 
wishes  to  minimize  his  cost  that  will  be  paid  to  the  other  player  (the  maximizer).  Both  players 
take  underlying  decisions  simultaneously  at  each  state,  with  the  complete  knowledge  of  the  state 
of  the  system  but  without  knowing  each  other’s  current  action  being  taken.  We  can  view  the 
maximizer  as  nature  which  controls  the  disturbances  that  are  unknown  to  the  minimizer  [39].  The 
minimizer  then  tries  to  get  the  best  performance  under  the  worst  possible  dynamic  choice  of  the 
unknown  disturbance  parameters  controlled  by  nature.  That  is,  the  minimizer  seeks  to  design  a 
robust  controller  that  works  well  under  the  worst  case  scenario  [6].  This  gives  rise  to  two-person 
zero-sum  Markov  games. 

Recently,  Markov  games  have  received  an  attention  in  the  queueing  system  literature  in  order 
to  solve  interesting  telecommunication  network  problems,  for  example,  admission  control,  routing, 
flow  control,  etc.  (see,  e.g.,  Altman’s  paper  [1]  and  the  references  therein  and  [24]).  However,  even 
though  the  worst-case  scenario^  that  will  be  played  by  the  maximizer  can  be  analyzed  for  some 
problems,  it  is  often  quite  difficult  to  obtain  such  a  policy  exactly.  In  that  case,  the  natural  step 
for  the  minimizer  is  to  “guess”  the  seemingly  worst  possible  play  of  the  maximizer  and  then  try  to 
optimize  his  performance.  If  the  minimizer  assumes  that  the  maximizer  will  play  a  fixed  policy,  to 
the  minimizer  the  problem  becomes  solving  a  Markov  decision  process  (MDP)  [42].  It  is  well-known 
that  solving  MDPs  in  general  (for  infinite  horizon  cost)  is  often  impractical  if  the  state  space  is 
large  even  though  exact  solution  techniques  are  available,  e.g.,  value  iteration  or  policy  iteration, 
etc.  (see,  e.g.,  [42]  for  a  substantial  discussion).  This  means  that  even  with  the  minimizer’s  guess 
on  the  opponent’s  play,  getting  the  best  performance  for  him  is  difficult. 

With  this  motivation,  we  focus  on  solving  two-person  zero-sum  Markov  games  with  infinite 
horizon  discounted  cost  and  infinite  horizon  average  cost  criteria  via  an  approximation  framework 
in  the  context  of  “planning”.  We  adopt  a  receding  horizon  control  approach.  The  idea  is  to  obtain 
an  optimal  solution  with  respect  to  a  “small”  moving  horizon  at  each  decision  time  and  apply  the 
solution  to  the  system.  In  fact,  this  approach  has  been  studied  in  several  contexts  in  various  fields, 
e.g.,  planning  in  economics  [27],  model  predictive  control  literature  [29,  34,  35],  and  planning  in 
MDPs  [23,  13],  etc.  In  the  game  setting,  Baglietto  et  al.  [5]  applied  team  theory  [25]  empirically  with 
a  receding  horizon  control  to  solve  a  routing  problem  in  a  communication  network  by  formulating 
the  problem  as  a  nonlinear  optimal  control  problem,  and  Van  den  Broek  [12]  considered  a  receding 
horizon  control  in  non-zero  sum  differential  games  [12],  specifically  analyzing  the  performance  of 

^What  we  mean  by  the  worst  case  scenario  is  the  case  when  the  maximizer  plays  his  optimal  equilibrium  policy 
that  we  define  later. 
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linear  quadratic  games.  The  receding  horizon  control  he  employed  is  somewhat  different  from  what 
we  do  here.  In  his  case,  at  any  decision  time,  the  players  base  their  actions  on  a  finite  horizon  but 
at  each  decision  time,  the  horizon  size  increases.  This  paper  focuses  on  a  fixed  receding  horizon 
size. 

At  each  state,  the  minimizing  player  selects  a  small  but  typical  horizon  and  solves  the  given 
Markov  game  with  the  finite  horizon  (called  the  subgame)  under  the  guess  that  the  maximizing 
player  makes  his  decision  based  on  his  best  performance  for  the  subgame.  The  minimizing  player 
then  takes  a  randomized  action  based  on  the  solution  to  the  subgame.  The  intuition  is  that  if  the 
horizon  is  “long”  enough  to  get  a  stationary  behavior  of  the  game,  this  moving  horizon  control 
would  have  a  good  performance.  Indeed,  we  first  show  that  the  value  of  the  game  played  by  the 
receding  horizon  control  from  both  players  converges  geometrically  fast,  with  given  discount  factor 
in  (0,1)  for  infinite  horizon  discounted  cost  and  with  given  “ergodicity  coefficient”  in  (0,1)  for  infinite 
horizon  average  cost,  to  the  optimal  equilibrium  value  of  the  game,  uniformly  in  the  initial  state,  as 
the  value  of  the  moving  horizon  increases  (Hernandez-Lerma  and  Lasserre  [23]  obtained  a  similar 
result  for  MDPs  [23]).  We  mention  here  that  this  error  analysis  assumes  that  the  maximizing 
player  also  plays  his  respective  half  of  a  eommon  copy  of  the  approach,  resulting  in  a  so-called 
eorrelated  equilibrium.  In  other  words,  the  maximizer  also  plays  the  receding  horizon  control  like 
the  minimize!.  We  then  present  an  error  bound  between  the  optimal  equilibrium  value  and  the 
value  of  the  game  in  which  the  minimize!  plays  the  receding  horizon  control  and  the  maximizer 
plays  the  worst  case  scenario  (playing  the  equilibrium  policy),  which  also  vanishes  to  zero  as  the 
size  of  the  receding  horizon  goes  to  infinity.  This  also  answers  an  important  question  that  arises  in 
the  Markov  game  literature:  what  size  of  the  planning  horizon  should  the  minimizer  use  to  achieve 
a  good  approximate  value  of  the  equilibrium  value? 

However,  as  we  mentioned  before,  solving  the  finite  horizon  Markov  game  or  subgame  is  also 
troublesome  if  the  state  space  is  large.  So  we  consider  an  approximate  receding  horizon  control. 
Rather  than  solving  the  finite  horizon  subgames  exactly  at  each  decision  time,  at  each  state,  the 
minimizer  will  make  his  decision  based  on  the  approximate  solution  for  the  subgame.  We  also 
analyze  the  performance  of  this  approach  as  previously  done  for  the  receding  horizon  control  in 
MDP  contexts  [15]. 

We  then  shift  our  attention  to  some  examples  of  the  approximate  receding  horizon  control 
for  the  minimizer.  These  are  all  heuristics  where  the  minimizer  guesses  the  maximizer’s  worst 
case  scenario  and  approximates  the  solution  of  the  subgame  and  takes  a  decision  based  on  this 
approximate  solution,  and  all  of  these  heuristics  can  be  implemented  via  a  simple  Monte-Carlo 
simulation.  The  first  two  approaches  that  will  be  taken  by  the  minimizer  aim  at  improving  a 
given  heuristic  policy  (or  a  set  of  multiple  heuristic  policies)  that  is  available  to  the  minimizer, 
based  on  policy  improvement  arguments.  We  first  consider  an  adaptation  of  the  rollout  approach 
by  Bertsekas  and  Castanon  [9]  into  the  Markov  game  setting.  In  this  approach,  the  minimizer 
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guesses  the  maximizer’s  worst  possible  play  and  uses  a  single  heuristic  policy  to  rollout  to  generate 
a  new  policy  whose  performance  in  terms  of  the  value  of  the  game  will  be  no  worse  than  the  given 
heuristic  policy  if  the  maximizer  indeed  plays  the  policy  the  minimizer  guessed.  To  the  minimizer, 
it  will  be  often  true  that  he  has  more  than  one  heuristic  policy  available  such  that  a  particular 
heuristic  policy’s  performance  is  near-otimal  for  the  particular  sample  paths  of  the  system.  He  may 
well  try  to  combine  these  policies  dynamically.  The  next  approach,  called  parallel  rollout  [13],  is  a 
generalization  of  the  rollout  approach.  It  also  appeals  to  the  policy  improvement  principle  and  can 
show  that  for  any  fixed  policy  taken  by  the  maximizer,  the  minimizer  will  improve  the  performances 
of  all  heuristic  policies  simultaneously  if  the  minimizer  plays  the  parallel  rollout  with  respect  to  the 
fixed  policy  of  the  maximizer.  In  other  words,  the  parallel  rollout  is  a  formal  method  of  generating 
a  policy  that  dominates  all  heuristic  policies  available.  Based  on  the  analysis  of  the  approximate 
receding  horizon  control  we  will  present  in  this  paper,  we  can  say  that  if  the  minimizer’s  guess  on 
the  opponent’s  play  is  good  and  the  resulting  approximate  value  of  the  subgame  is  also  good,  the 
two  approaches  will  yield  a  reasonable  performance  to  the  minimizer.  Furthermore,  given  e  >  0,  we 
provide  the  value  of  the  receding  horizon  which  guarantees  that  for  any  fixed  policy  played  by  the 
maximizer,  the  parallel  rollout  policy  with  the  finite  horizon  played  by  the  minimizer  with  respect 
to  the  fixed  policy  of  the  maximizer  yields  a  value  of  the  game  no  larger  than  that  of  the  game 
played  by  any  policy  among  heuristic  policies  by  the  minimizer  and  by  the  fixed  policy  chosen  by 
the  maximizer  plus  e. 

The  final  example  approach  is  motivated  by  hindsight  optimization  proposed  in  [13,  16].  By  this 
approach,  at  each  state,  the  minimizer  evaluates  his  candidate  randomized  actions  based  on  the 
analysis  of  the  expected  optimal  hindsight  performance  over  a  finite  horizon  under  the  assumption 
that  the  maximizer  plays  the  worst-case  fixed  policy  that  chosen  by  the  minimizer.  This  approach 
has  a  flavor  of  heuristically  adapting  the  hindsight  optimal  solutions  into  on-line  solutions  (via 
on-line  simulation). 

This  paper  is  organized  as  follows.  In  Section  2,  we  formalize  mathematically  the  Markov 
games  we  consider.  We  then  introduce  the  (approximate)  receding  horizon  control  in  Section  3 
and  analyze  performances.  We  then  discuss  three  heuristics  for  the  approximate  receding  horizon 
control  in  Section  4.  In  Section  5,  we  discuss  implementation  issues  and  other  research  directions. 

2  Markov  Game 

In  this  section,  we  formulate  the  two-person  zero-sum  Markov  game  introduced  by  Shapley  [44]  in 
a  formal  mathematical  setting.  For  a  substantial  discussion  on  this  topic,  see,  e.g.,  [18]  [7]  or  [40]. 
Let  X  denote  a  finite  state  space  and  for  x  G  X,  N(x)  and  M{x)  denote  the  finite  sets  of  actions 
for  the  minimizing  player  (minimizer)  and  the  maximizing  player  (maximizer),  respectively.  Both 
players  play  underlying  actions  simultaneously  at  each  state,  with  the  complete  knowledge  of  the 
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state  of  the  system  but  without  knowing  each  other’s  current  action  being  taken.  At  each  state 
X,  each  player  will  consider  choosing  an  action  to  take  according  to  a  probability  distribution  over 
the  available  actions.  For  each  x  G  X,  we  define  the  players’  “admissible  randomized  action  sets” 
as  G{x)  and  F{x)  such  that 

G{x)  =  {g  G  ^  =  1,  and  Vi,  ft  >  0} 

i^N{x) 

F{x)  =  {/ G  /j  =  1,  and  Vi,/i  >  0} 

iGM(x) 

Once  the  actions  n  G  N{x)  and  m  G  M{x)  at  state  x  are  taken  by  both  players,  the  state  transitions 
probabilistically  to  next  state  y  according  to  the  probability  p{y\x,  n,m).  From  this,  we  induce  the 
probability  Pxy{g,f)  denoting  the  probability  of  transitioning  from  state  x  to  state  y  under  the 
randomized  actions  g  G  G{x)  and  /  G  F{x): 

PxyigJ)  =  E  E  gnfmp{y\x,n,m). 

n^N(x)  m^M(x) 

If  the  minimizer  takes  a  randomized  action  g  G  G{x)  and  the  maximizer  takes  /  G  F(x)  at  state  x, 
then  the  minimizer  gets  the  expected  payoff  (cost)  of  Gx{g,  /),  which  is  given  by 

Gx{gJ)  =  E  E  E  c{x,  y,  n,  m)p{y\x,  n,  m)gnfm, 

y^X  nGN{x)  mGM{x) 

where  c(x,  y,  n,  m)  is  the  immediate  payoff  to  the  minimizer  (the  negative  of  this  will  be  incurred 
to  the  maximizer)  associated  with  a  current  state  and  the  next  state  pair  (x,  y)  after  taking  the 
action  n  G  N{x)  if  action  m  is  taken  by  the  maximizer.  We  assume  that  \Gx{g,  f)\  <  Cmax  <  oo  for 
any  x,  g  and  /.  We  now  define  a  stationary  policy  vr  or  strategy  of  the  minimizer  to  be  a  function 
TT  :  X  ^  G{X)  and  denote  11  as  the  set  of  all  possible  policies,  and  similarly  a  policy  cj)  and  the  set 
$  are  defined  for  the  maximizer.  We  will  say  that  a  stationary  policy  is  pure,  if  the  randomized 
action  selected  by  the  policy  at  every  state  yields  a  non-randomized  action  choice,  i.e.,  an  action  is 
selected  with  probability  one. 

In  this  paper,  we  consider  two  objective  function  criteria:  infinite  horizon  discounted  cost 
and  average  cost.  Given  a  policy  vr  selected  by  the  minimizer  and  a  policy  (j)  selected  by  the 
maximizer,  we  define  the  value  of  the  game  played  with  tt  and  (f  by  the  minimizer  and  the  maximizer, 
respectively,  with  a  starting  state  x  as 

OO 

h"oo(vr,(/>)(x)  :=  F;{^7*C'a„(7r(xt),  (/>(xt))|xo  =  x} 
i=0 

for  the  infinite  horizon  discounted  cost  criterion,  where  xt  is  a  random  variable  denoting  the  state 
at  time  t  following  the  policies  vr  and  (f>,  and  7  G  (0, 1)  is  a  given  discount  factor.  The  discount 
factor  can  be  interpreted  as  the  probability  that  the  game  will  be  allowed  to  continue  after  the 
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current  decisions  made  by  both  players.  Similarly,  we  define  the  value  of  the  game  for  the  infinite 
horizon  average  eost  eriterion  as 

1 

Joo{t^A){x)  :=  lim  C'a:,(7r(xt),  (/>(xt))|xo  =  x} 

ii— >oo  ri 

with  given  policies  tt  and  (j). 

The  goal  of  the  minimizer  (the  maximizer)  is  to  find  a  policy  vr  G  11  (i^  G  d>)  which  minimizes 
(maximizes)  the  value  of  the  game.  Throughout  this  paper,  Voo  always  refers  to  the  value  of  the 
game  with  the  infinite  horizon  discounted  cost  criterion  and  Jqo  refers  to  the  value  of  the  game 
with  the  infinite  horizon  average  cost  criterion,  so  that  we  will  omit  which  criterion  we  mention  at 
any  point  if  the  context  is  clear. 

2.1  Some  preliminaries 

2.1.1  Infinite  horizon  discounted  cost 

It  is  well-known  (see,  e.g.,  [40])  that  there  exists  an  optimal  equilibrium  poliey  pair  tt*  G  11  and 
(/>*  G  ‘h  such  that  for  all  vr  G  11  and  (/>  G  and  x  G  X, 

Voc{Tr*,(l)){x)  <  Voc{tt* A*){x)  <  Voo{7r,(l)*){x). 

We  will  refer  to  the  value  V"oo('/r*,  (p*){x)  as  the  equilibrium  value  of  the  game  associated  with  state 
X  and  to  n*  and  4>*  as  the  equilibrium  polieies  for  the  minimizer  and  the  maximizer,  respectively. 
We  will  write  Voo  (vr*  ,(/>*)  as  and  focus  on  finding  or  approximating  the  policy  vr*  (note  that 
the  content  of  this  paper  can  be  interpreted  for  the  maximizer  case  by  changing  the  role  of  the 
minimizer  and  the  maximizer) .  A  primitive  but  important  notion  that  arises  in  game  theory  is  that 
of  dominanee  (see,  e.g.,  [19]).  We  will  say  that  a  policy  vri  G  11  (weakly)  dominates  vr2  G  11  if  and 
only  if  for  any  (/>  G  1>,  Foo(vri,  (p)  <  Voo{j2,  </>)• 

Now  let  B{X)  be  the  space  of  real- valued  bounded  measurable  functions  on  X  endowed  with 
the  supremum  norm  jjFjj  =  sup^,  ll^(x)j  for  V  G  B{X).  We  define  several  operators  that  map  a 
function  in  B{X)  to  a  function  in  B{X):  for  all  vr  G  H,  (/>  G  I’,  F  G  B{X),  and  x  G  X, 


T{V){x) 


T^A^fix) 


inf  sup 

g&G{x)  f£F{x) 


Cx{gJ)+iY.^^ 

ydX 


xy 


i9,f)V{y) 


Ca;{TT{x),(p{x))  +7  X]  PxyAA)Aix))V{y) 

ydX 


T^{V){x) 


inf 

g£G{x) 


Cx{gA{x)) 

yGX 


xy 


{g,<p{x))V{y) 
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T^{V){x)  = 


sup 

/6F(x) 


Cx{Tr{x),f)  +7  X]  -Pa;j/(7r(x),/)y(2/) 


yex 


It  is  well-known  [18,  40]  that  each  of  the  above  operators  is  a  contraction  mapping  in  B(X), 
that  is,  for  the  case  of  T,  for  any  Vi  and  V2  in  B{X),  ||T(Vi)  —  T{V2)\\  <  ^\\Vi  —  V2II,  and  that  each 
operator  has  a  monotonicity  property,  that  is,  if  Vi{x)  <  V2{x)  for  all  x  G  X,  T{Vi){x)  <  T{V2){x) 
for  all  X  G  X  for  the  case  of  T.  Furthermore,  there  exist  a  unique  fixed  point  v  G  B{X)  such  that 
T{v)  =  V  and  v  is  equal  to  Vocix*  and  a  unique  fixed  point  u  G  B{X)  such  that  =  u 

and  u  =  Voo{tt,4>).  We  finally  remark  that  for  all  x  G  X,  the  infimum  and  supremum  in  the 
definitions  of  the  operators  T,  T^,  and  Tt^  are  achieved  by  elements  in  G{x)  and  F{x)  (see,  e.g.. 
Section  3  in  [38]  or  [40]). 

Let  {V*}  be  the  sequence  of  value  iteration  functions  V*  :=  T{V*_i)  where  n  =  1,2,...  and 
let  Vf  be  an  arbitrary  function  in  B{X),  but  we  assume  that  maxa,  <  Cmax/Cl  —  7)  for 

a  technical  reason.  It  is  straightforward  to  show  that  as  n  ^  00,  V*  converges  to  Voc{x*,fi*) 
geometrically  fast  in  7  by  the  contraction  mapping  property  and  the  Banach  fixed  point  theorem. 
Furthermore,  V*  is  the  equilibrium  value  of  the  finite  n-horizon  game.  We  introduce  a  nonstationary 
or  time-dependent  policy  for  the  minimizer  tt  =  {7ro,7ri, ...,}  where  tt*  G  11  and  denote  the  set  of 
all  possible  nonstationary  policies  as  11  and  similarly  define  for  the  maximizer.  Then  V*{x)  is  the 
value  of  the  game  when  starting  in  state  x,  both  players  play  their  own  equilibrium  nonstationary 
policies  for  the  n-horizon  game  with  the  terminal  cost  of  Vf  (see,  e.g.,  [1,  46])  and  is  given  by 

n—1 

V*{x)  =  inf  sup  E{'^-f^Cxt{7Tt{xt),  fitixt))  +  {Xn)\xo  =  x} 

for  X  G  X. 

2.1.2  Infinite  horizon  average  cost 

Unlike  the  discounted  cost  case,  it  is  not  true  that  there  always  exists  an  equilibrium  value  for 
average  cost  Markov  games  [21]  in  general.  We  make  following  assumption: 

Assumption  2.1  The  Markov  chain  associated  with  each  pair  of  any  pure  policies  is  irreducible 
and  there  exists  p  >  D  such  that  for  any  tt  G  ^  and  G  ^  and  x  G  X, 

Pxx{x{x),(f){x))  >  p. 

The  first  assumption  implies  that  the  underlying  Markov  chain  is  a  recurrent  unichain  and  the 
second  assumption  is  the  strong  aperiodicity^  condition. 


^This  aperiodicity  assumption  is  not  a  serious  assumption  (see,  e.g.,  page  231  in  [46]). 
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Under  Assumption  2.1,  there  exists  an  optimal  equilibrium  policy  pair  tt*  G  11  and  (jf  ^  ^  such 
that  for  all  TT  G  n  and  (/>  G  and  x  G  A, 

Joo(7r*,(/))(x)  <  Jooij*  A*)ix)  <  Joo(vr,  (/)*),  (x). 

and  in  fact,  each  term  in  the  above  inequalities  is  independent  of  the  initial  state  x  so  that  we  can 
omit  X  in  each  term  above  [46].  Furthermore,  tt*  {(f)*)  here  will  be  a  different  policy  in  general  from 
the  equilibrium  policy  for  the  discounted  cost  case.  We  will  abuse  the  notation  for  our  convenience 
and  what  we  refer  to  will  be  clear  from  our  presentation.  We  will  refer  to  the  value  Joo(7r*,  </>*)  as 
the  equilibrium  value  of  the  game  and  to  tt*  and  cp*  as  the  equilibrium  policies  for  the  minimizer 
and  the  maximizer,  respectively,  similar  to  the  discounted  case.  We  will  write  Jqo  (vr* ,  i;/)* )  as 
and  focus  on  finding  or  approximating  the  policy  tt*.  The  dominance  notion  is  also  similarly 
defined:  we  will  say  that  a  policy  tti  G  11  (weakly)  dominates  7r2  G  11  if  and  only  if  for  any  (p  G 

Joo  {mA)  <  Joo  {tT2,(P)- 

We  define  several  operators  that  map  a  function  in  B{X)  to  a  function  in  B{X):  for  all  vr  G 
n,  (/>  G  <h,  U  G  B{X),  and  x  G  A, 


It  is  well-known  (see,  e.g.,  [18,  40,  46])  that  each  of  the  above  operators  has  a  monotonicity 
property  and  the  infimum  and  supremum  in  the  definitions  of  the  operators  T,  T^,  and  are 
achieved  by  elements  in  G{x)  and  F{x). 

Let  {V*}  be  the  sequence  of  value  iteration  functions  with  respect  to  T,  V*  :=  T{V*_i)  where 
n  =  1,2,...  and  Vq  is  arbitrary  function  in  B{X).  It  has  been  shown  [46]  (under  Assumption  2.1 
on  average  Markov  games)  that  as  n  ^  oo,  V*  converges  to  a  function  G  B{X)  that  satisfies 

r(iA  )(x)  =r^  +  tC(x)  for  all  X  G  A. 

Furthermore,  V*  is  the  equilibrium  value  of  the  finite  n-horizon  game  without  discount.  In  this 
paper,  we  will  assume  that  Uo*(^)  ~  ^  for  all  x  G  A. 
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3  Receding  Horizon  Control 

3.1  Infinite  horizon  discounted  cost 

As  we  mentioned  before,  solving  a  large-state  space  Markov  games  for  infinite  horizon  costs  is  often 
impractical.  Therefore,  we  adopt  a  finite-horizon  approximation  scheme  for  the  infinite  horizon 
problem.  We  select  a  small  but  typical  horizon  and  solve  for  the  Markov  game  with  the  finite 
horizon  (in  our  case,  we  are  interested  in  only  the  optimal  current  or  initial  randomized  actions  for 
the  minimizer  and  the  maximizer).  That  is,  we  solve  the  Markov  game  with  the  total  discounted 
cost  criterion  at  each  decision  time.  The  intuition  is  that  if  the  fixed  horizon  is  “long”  enough  to  get 
a  stationary  behavior  of  the  system,  this  moving  horizon  control  would  have  a  good  performance. 
Indeed,  we  show  that  the  value  of  the  game  of  the  receding  horizon  control  converges  geometrically 
to  the  equilibrium  value,  uniformly  in  the  initial  state,  as  the  value  of  the  moving  horizon  increases. 

The  receding  horizon  control  is  simply  defined  as  follows.  Given  a  finite  horizon  H  >  1,  we 
define  the  receding  i7-horizon  control  as  a  policy  G  11  for  the  minimizer  and  a  policy  G  ^ 
for  the  maximizer  such  that  =  T(V^_^)(x)  for  all  x  €  X.  Note  that  the  receding 

if-horizon  control  policy  is  a  stationary  policy.  We  have  the  following  bound  on  the  performance 
error. 

Theorem  3.1  For  all  x  G  X, 

IC(^)  -  VM,  4>*h){^)\  < 


Proof:  See  the  proof  of  Theorem  3.6  below  with  e  =  0  and  n  =  H  —  1.  I 

We  remark  that  the  same  result  can  be  obtained  alternatively  from  Lemma  4.3.5  in  page  181 
in  [18]  via  a  simple  algebraic  manipulation. 

From  the  theorem  above,  we  can  see  that  the  receding  horizon  control  gives  a  good  approxi¬ 
mation  for  the  infinite  horizon  equilibrium  policy  for  each  player,  and  the  value  of  the  game  using 
these  policies  approaches  to  the  equilibrium  performance  for  the  infinite  horizon  cost  geometrically 
in  7.  Furthermore,  by  letting  •2C'max  =  we  can  obtain  the  necessary  value  of  the  planning 

horizon  which  guarantees  that  the  performance  of  the  receding  horizon  control  will  be  within  e  of 
the  equilibrium  value. 

The  minimizer  will  play  the  game  by  the  receding  horizon  control  based  on  correlated  equilib¬ 
rium.  That  is,  he  assumes  that  the  maximizer  also  plays  the  common  copy  of  the  receding  horizon 
control.  We  need  to  analyze  the  error  bound  when  the  maximizer’s  play  is  the  true  worst  case 
scenario,  4>* .  We  begin  with  a  lemma  regarding  the  monotonicity  property  of  the  (/,-operator. 
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Lemma  3.1  For  any  vr  G  11  and  4>  £  suppose  there  exists  il>  £  B{X)  for  whieh  <  V'(x) 

for  all  X  £  X;  then  V'oo('/r,  i^)(x)  <  'f{x)  for  all  x  £  X . 

The  above  lemma  can  be  easily  proven  by  the  monotonicity  property  of  the  operator  and  the 
convergence  to  the  unique  fixed  point  of  V'oo(7r,  (f)  from  successive  applications  of  the  operator.  The 
next  lemma  states  that  the  function  V*  is  non-increasing  in  n  under  a  suitable  initial  condition  and 
is  a  simplified  version  of  Lemma  3.1  in  [45]  in  our  context.  We  provide  the  proof  for  completeness. 

Lemma  3.2  Suppose  Vf  is  seleeted  sueh  that  T{Vf){x)  <  Lq*(x)  for  all  x  £  X.  Then,  for  H  = 
1, 2, ...,  and  for  all  x  £  X ,  Vfj{x)  <  Vfj_-^^{x). 

Proof:  The  proof  is  by  induction  on  H.  For  H  =  1,  since  Vf  =  T(yf),  we  have  Vf{x)  <  Vf^x) 
for  all  X  G  X  from  the  assumption. 

Assuming  that  the  assertion  is  true  for  H  =  1, ...,  A:,  we  prove  that  it  holds  for  H  =  k  +  1.  For 
all  X  £  X, 

V,\,{x)  =  T{V,*){x) 

=  T{T{VU)){x) 

<  T(F^*_i)(x)  from  the  monotonicity  of  T  and  the  assumption 

=  vcix), 

which  proves  the  claim.  ■ 

We  remark  that  one  such  Vf  can  be  simply  given  by  Vf{x)  =  Cma^/ (1  —  7)  for  all  x  G  X. 

Theorem  3.2  Suppose  Vf  is  seleeted  sueh  that  for  all  x  £  X,  T{Vf){x)  <  Vq^C^)-  Then,  for  all 
X  £  X, 

0  <  V^{7r*H,  </.*)(x)  -  C(x)  <  •  2C'^ax 

1-7 


Proof:  The  lower  bound  is  trivially  true  so  that  we  prove  the  upper  bound  case. 


T^*,^*{V^){x)  = 


< 


Cx{Tr*Hix),f)*{x))  +  7  X] 

yex 

Cx{FH{x),f)*{x))  +  7  X]  P^y^'^H{x),(t>*{x))VH-i{y)  by  Lemma  3.2 

y&X 


< 


sup 

f&F(x) 


Cx{xh{x)J)  +  7  Y]  Pxy{xH{x),f)VH_^{y) 


y&X 


Cx{'FH{x),(t)H{x))  +  7  X]  PY'^H{x)AHix))VH-i{y)  by  definition  of 

y&X 

T.h^rnYH-i){x)  =  T{Vf,_,){x)  =  V^{x) 
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Therefore,  by  Lemma  3.1,  V"oo(vrJ^,  4>*){x)  <  V^{x)  for  all  x  G  X.  It  follows  that  for  all  x  G  X, 

VMA*){x)  -  v:,{x)  <  V*h{x)  -  C(x). 

Observe  that  maXa,|V^(x)|  <  for  all  n  >  0  under  the  assumption  of  Vq.  Therefore,  for 

n  =  0,1,..., 

20 

max|C(x)  -  <  7"max|C(x)  -  V,*{x)\  <  •  7".  (1) 

X  X  1  —  ^ 

Combining  the  two  inequalities,  we  have  the  desired  result.  I 

As  we  expected,  the  error  bound  vanishes  to  zero  as  the  size  of  the  horizon  increases  to  infinity 
geometrically  fast  with  a  given  discount  factor. 

Consider  the  following  condition:  there  exists  a  function  5  defined  on  X  such  that  0  < 
^  Pxyif,g)  >  S{y)  for  all  x,y,g,f.  It  turns  out  that  if  the  given  Markov 
game  meets  this  condition,  the  error  bounds  in  the  above  theorems  can  be  improved  by  a  fac¬ 
tor  (1  —  ^^S{x))^  as  in  the  MDP  case  [23].  Let  /3  =  1  —  define  two  probability 

distributions  P'  and  V'  such  that 

Pxy{f,9)  =  ^[Pxy{f,g)  -  {I  -  P)Hy)]- 

Then,  we  can  define  a  transition  probability  P  by 

Pxy{f,g)  =  pPxy{f,g)  +  (1  -  P)i^{y)- 

We  further  define  the  operator  T'  :  B{X)  B{X)  as  in  T  except  that  we  use  P'  instead  of  P. 
Then,  for  any  function  v  G  B{X),  the  T  and  T'  operators  are  related  by 

T{v){x)  =  T'{v){x)  +  7(1  —  /3)'ip{v),x  G  X. 

where  ipiv)  = 

Let  {V^}  be  the  sequence  of  value  iteration  functions  with  respect  to  T',  :=  T'iy^_i)  where 

n  =  1,2, ...  and  set  Vq{x)  =  Vq{x)  =  Cma^/ (1  —  7)  for  all  x  G  X.  By  induction,  we  can  show  that 

V:ix)=V;ix)  +  Cn,xGX.  (2) 

where  Cn  is  the  constant  given  by 

n—1 

k=0 
00 

=  7(l-/3)E(^/?)"V'(lC-i-fc) 

k=0 


(3) 
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setting  to  the  zero  function  for  A:  <  0  if  n  >  1  and  Cq  =  0.  From  this,  we  can  conclude  that 
(c.f.,  Lemma  4.1  in  [23]) 

V^ix)  =  y^(x)  +  C(7,/3)V’(0,^  e 

where  =  lim„^oo  C'(7,/3)  =  7(1  —  /3)/(l  —  7/3).  Observe  that  is  the  optimal 

equilibrium  value  function  for  the  underlying  Markov  game  replaced  with  P'  and  discount  factor 
7/3.  By  the  same  arguments,  we  can  show  that  for  any  vr  G  11  and  (/>  G  ‘h, 

Foo(vr,(/>)(x)  =  V^{7r,<j)){x)  +  C{j,  P)i;{Voc{n,(f))),x  G  X. 

This  immediately  implies  that 

C(x)  -  VMA*H)ix)  =  V^x)  -  +  C(7,/3)V’(C  -  VMA*h))- 

Observe  that  a  policy  pair  vr  G  11  and  </>  G  such  that  T'^  ^{Vlj_^){x)  =  T' {V^_^){x)  for  all 
X  G  X  prescribes  the  same  randomized  action  choice  as  and  (/>^  from  the  relationship  given  by 
Equation  (2).  Now,  by  majorization  of  V^{x)  —  Voo{x*H-,4>*H)ip)  from  Theorem  3.1  with  the 
observation  just  made,  it  follows  that 

max[l/^(x)-Foo(7r|^,(/>H)(a;)]  <  •2gmax+g(7, /3)  max[l/^(x)-Foo(7r|^,  </>h)(x)],  x  G  X. 

We  can  also  minorize  V^{x)  —  V'oo(7rJ^,  (/>^)(x),  from  which  we  conclude  that  for  all  x  G  X, 

-  C»(,ri.ft){x)|  <  [1  -  C(7./3)]-‘  ■  ■  2C„„, 

The  upper  bound  on  Theorem  3.2  can  also  be  improved  by  a  factor  of  (5^  with  the  same  arguments. 

3.2  Infinite  horizon  average  cost 

The  receding  horizon  control  is  defined  as  follows.  Given  a  finite  horizon  LA  >  1,  we  define  the 
receding  LL-horizon  control  as  a  policy  7r|^  G  11  for  the  minimizer  and  a  policy  (/>^  G  for  the 
maximizer  such  that  =  T(Vjj)_^)(x)  for  all  x  G  X.  We  now  present  the  perfor¬ 

mance  error  of  the  receding  horizon  control  in  terms  of  the  infinite  horizon  average  cost  comparing 
with  the  equilibrium  value  under  our  assumptions  on  Markov  games.  The  analysis  primarily  builds 
on  the  work  by  Van  der  Wal  [46].  We  begin  with  a  modified  version  of  Van  der  Wal’s  Corollary 

13.2  in  page  230  in  [46]  within  our  context.  For  a  function  v  G  B(X),  let  span  semi-norm  of  v  be 
sp(x)  =  maxa;  v{x)  —  min^,  v{x). 

Theorem  3.3  Assume  that  Assumption  2.1  holds.  For  any  V  G  B{X),  consider  two  policies  vr  G  11 
and  (/>  G  ^  such  that  Tt^^^{V){x)  =  T{V){x)  for  all  x  G  X.  Then,  for  any  vr'  G  11  and  (f)'  G 

Joo(vr,</.')  <  J^+sp(r(V)-V) 

JMA)  >  -  sp(T(V)  -  V) 
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From  now  on,  we  will  set  \X\  =  s  (we  naturally  assume  that  s  >  1).  Under  the  aperiodicity 
assumption  (the  second  part  in  Assumption  2.1),  there  exists  a  constant  rj,  with  0  <  ry  <  1,  such 
that  the  following  scrambling  condition  holds:  for  any  vr,  tt'  G  II  and  any  G  and  for  all 

x,y  e  X, 

where  'P^Tr^tpiv)  denotes  the  probability  that  the  initial  state  x  will  reach  the  state  z  in  s  —  1  time 
steps  under  the  policies  tt  and  (/>.  We  will  refer  to  77  as  an  ergodicity  coefficient. 

—  —  n 

Lemma  3.3  Assume  that  Assumption  2.1  holds.  For  n  =  0, 1, sp(lU_,_;^  —  V*)  <  2r/^-iC'max 

Proof:  Van  der  Wal  showed  that  (see  page  235  in  [46])  sp(U^_,_g  —  <  rj  ■  sp(t^_,_;^  —  V*), 

n  =  0, 1, ...,  which  implies  that 

sp(U;+i  -  V:)  <  7?^sp(U;  -  Uo*)  for  n  =  0, 1, ... 

Since  sp(U]*  —  Uq*)  —  ^Umax  with  Vq  =  0,  we  have  the  desired  result.  ■ 

The  theorem  and  the  lemma  above  yield  immediately  the  following  result. 

Theorem  3.4  Assume  that  Assumption  2.1  holds.  Consider  the  reeeding  H -horizon  eontrol  tt'^  G 
n  for  the  minimizer  and  G  for  the  maximizer  sueh  that  T-n*^,4>*jjiyH-i){x)  =  T(U^_^)(x)  for 
all  X  G  X.  Then, 

We  can  see  again  that  the  receding  horizon  control  for  the  average  cost  case  also  gives  a  good 
approximation  for  the  infinite  horizon  equilibrium  policy  for  each  player  and  the  value  of  the  game 
by  the  policies  approaches  to  the  equilibrium  performance  for  the  infinite  horizon  average  cost 

H-l 

geometrically  in  the  ergodicity  coefficient  rj.  Furthermore,  by  letting  2ri  ‘’-i  Umax  =  e,  we  can 
obtain  the  necessary  value  of  the  planning  horizon  which  guarantees  that  the  performance  of  the 
receding  horizon  control  will  be  within  e  from  the  equilibrium  value. 

An  error  bound  when  the  maximizer’s  play  is  the  true  worst  cast  scenario  4>*  is  also  obtained 
directly  from  Theorem  3.3. 

Theorem  3.5  Assume  that  Assumption  2.1  holds.  Consider  the  reeeding  H -horizon  eontrol,  'K*fj  G 
n  for  the  minimizer  sueh  that  T7r|^(U^_i)(x)  =  T{Vf^_^){x)  for  all  x  G  X .  Then, 

0  <  Joc{7T*h,  r )  -J:o<  2r?^Umax 
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The  error  bounds  we  presented  above  vanish  geometrically  to  zero  as  the  size  of  the  horizon 
increases  to  infinity.  However,  it  depends  on  the  size  of  the  state  space.  Therefore,  if  s  is  a  huge 
number,  the  error  bound  will  be  large  with  relatively  small  H.  We  now  add  a  new  condition  to 
the  transition  probability  matrix  so  that  we  can  eliminate  the  dependence  on  the  size  of  the  state 
space. 

Assumption  3.1  There  exists  a  nonnegative  funetion  fi  G  B{X)  sueh  that  for  some  eonstant  a, 
with  0  <  a  <  1, 

Pxy{Tr{y),(j){y))h{y)  <  o:Kx) 

yex 

for  all  X  £  X,  TT  £  and  4>  £ 

We  will  refer  to  this  assumption  as  the  //-recurrent  condition  [17]. 

We  define  the  //-norm  of  a  function  v  G  B(X),  ||u||^  given  by 

||u||^  =  inf{c  G  7^1  |u(x)|  <  cp{x),\/x  G  X}. 

It  is  well-known  that  under  the  recurrent  condition,  T  is  a  contraction  mapping  with  respect  to 
//-norm.  That  is,  for  any  v,w  G  B(X), 

\\f{v)  -  f{w)\\f,  <  a\\v  -  w\\f,. 

Furthermore,  it  can  then  be  easily  proven  (see,  e.g.,  page  199  in  [46])  that  for  any  x  G  X  and  for 
any  v,w  G  B(X), 

—air{x)\\v  —  w\\^  <  T{v){x)  —  T{w){x)  <  air{x)\\v  —  w\\^. 

It  follows  that  for  n  =  1,  2, ..., 

sp(t4%i  -  V*)  <  2q; max //(x)  1114*  -  <  2a^  max fi{x)\\V{  -  t^lU  =  2a’'max//(x)llt4IU- 

X  X  X 

Because  UHi*]]/^  <  (see,  e.g.,  page  199  in  [46]),  we  have  the  following  immediate  result  with 
//-recurrent  condition. 

Proposition  3.1  Assume  that  Assumptions  2.1  and  3.1  hold.  Consider  the  reeeding  H -horizon 
eontrol,  tt^  G  H  for  the  minimizer  and  ^  ^  maximizer  sueh  that  ~ 

T{Vf^_i){x)  for  all  x  G  X.  Then  under  the  p-reeurrent  eondition, 

iJoci-n-R :  (1>h)  -  JL\  <  - 2C'maxmax//(x) 

1  —  a  a; 

0  <  Joo{xh,  4>*)  -  Jlc,  <  - 2C'max  max//(x). 

I  —  a  a; 
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Therefore,  the  above  theorem  establishes  the  geometric  convergence  of  the  receding  horizon 
control  independently  of  the  state  space  size.  To  apply  the  receding  horizon  control,  we  need  to 
know  the  exact  value  of  the  finite  horizon  subgames.  However,  in  practice,  getting  the  true  {H  —  1)- 
horizon  equilibrium  value,  in  order  for  the  minimizer  to  get  the  receding  H-horizon  control  policy,  is 
also  troublesome  if  the  state-space  size  is  huge.  Motivated  by  this,  we  now  analyze  the  approximate 
receding  horizon  control. 

3.3  Analysis  of  approximate  receding  horizon  control 
3.3.1  Infinite  horizon  discounted  cost 

We  start  with  lemmas  to  state  our  main  result  for  the  approximate  receding  horizon  control. 
Lemma  3.4  For  all  x  €  X  and  n  =  0, 1, ..., 

\V:+,{x)-V:{x)\<rf^-2Cra.. 


Proof:  This  is  directly  obtained  from  the  contraction  mapping  property.  I 

The  theorem  below  states  an  error  bound  from  the  equilibrium  value  of  the  game  when  both 
the  minimizer  and  the  maximizer  play  the  receding  horizon  control  based  on  the  same  approximate 
value,  i.e.,  correlated  equilibrium  policies. 

Theorem  3.6  Given  V  G  B{X)  sueh  that  for  some  n  >  0,  \V*{x)  —  V {x)\  <  e  for  all  x  in  X, 
eonsider  a  poliey  tt  for  the  minimizer  and  (f  for  the  maximizer  sueh  that  for  all  x  G  X,  T.,^^^iy){x)  = 
T{V){x).  Then,  for  all  x  G  X, 

-  Aoo(7r,(/))(x)|  <  ^  •  2Cmax  + 

(1-7)  1-7 


Proof:  From  the  contraction  mapping  property  of  the  T  operator,  for  all  x  in  X, 

|r(K:)(x)-r(F)(x)|  <  7-max  |l/;(x)-F(x)|  <76 

X 

and  from  TiVf^)  =  and  a  successive  application  of  the  contraction  property  we  have 


max  \V^{x)-  (x) |  <7*^+^  max  | (x)  -Vf{x)\<  ■ 

X  X  1  —  ^ 

Therefore,  from  Equation  (4)  and  (5)  and  =  TiV*)  by  definition,  for  all  x  G  X, 
|C(x)-r(F)(x)|  <  |C(x)-T(F;)(x)|  +  |r(F;)(x)-T(F)(x)| 

1-7 


(4) 

(5) 


(6) 
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Below  we  show  that  \T{V){x) —  Voo{7r,  </>)(x)|  <  for  all  x  G  X.  It  then  follows 

that  from  Equation  (6),  for  all  x  €  X, 

\V^ix)-V^{n,<P){x)\  <  \V^ix)-T{V){x)\  +  \T{V){x)-Voo{n,<P){x)\ 

2C'max  n+1  ,  ,  7^(1  +  T)  , 

s  — -T  +^‘+-rvr^  (1-7)^ 

7"+'(2-7) 

(1-7)2 

which  gives  the  desired  result. 

From  Lemma  3.4  and  Equation  (4),  we  have  that  for  all  x  G  X,  by  letting  w  =  •  2  Cm  ax. 

V{x)  <V*{x)  +  e<  E„%i(x)  +  e  +  tc  <  T{V){x)  +^e  +  e  +  w.  (7) 

Then  for  all  x  G  X, 

T{V){x)  =  r^^^(F)(x)  =  Cx{TT{x),(f>{x))  +7  Pxy{'n'{x),(j){x))V{y)  by  definitions  of  tt  and  4>  and  T 

y&X 

<  Cx{'n-{x),(f>{x))  +  7  Pxy{T^{x),4>{x))[T{V){y)  +je  +  e  +  w]  by  Equation  (7) 

yex 

=  Cx(7r(x),(f)(x))  +  7  X]  -Pxy(7r(x),(f)(x))T(V)(y)  +  7e(l  +  7)  +  yw 

y&X 

=  Cx{tt{x),4>{x))  +j'^Pxy{Tr{x),(j){x))  (cy{7r{y),(l){y))  + Pyz{Tr{y),(f>{y))V{z)j 

y&X  \  z&X  / 

+7e(l  +  7)  +  7r(; 

=  Cx{TT{x),(t){x))  +  7  X]  (/>(x))Cj^(7r(y),  (/>(?/)) 

y&X 

+7^  X]  X]  ^a:y(vr(x),(/>(x))Pj^^(7r(y),(/)(y))E(2;)  +76(1  +  7)  +  7^ 

?;ex  z&X 

<  Cx{TT{x),(t){x))  +  7  X]  <^(2/)) 

y&X 

+7^e(l  +  7)  +  7e(l  +  7)  +  (7^^  +  ^w) 

Keep  iterating  (under  the  sum  sign)  this  way,  we  have  that  for  all  A:  =  0, 1, ...,  and  x  G  X, 

~  k 

T{V){x)<E  ^7*Ca„(7r(xt),(/>(xt))|xo  =  X  +  7^+^E[T(I/)(xfc+i)|xo  =  x] 

.t=o 

+7e(l  +  7)  H - +  7^^^e(l  +  7)  +  {jw  H - h  (8) 

where  xt  is  the  random  variable  representing  the  state  at  time  t  under  tt  and  (j).  Since  T(y)  is 
bounded,  the  second  term  on  the  r.h.s.  of  Equation  (8)  converges  to  zero  as  k  ^  00  and  the 
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first  term  becomes  Therefore  it  follows  that  T{V){x)  —  Vodir,  <p){x)  < 

Therefore,  T{y){x)  -  Vo^irr,  (j)){x)  <  for  all  x  G  X. 

Similarly,  we  can  show  that  T{V){x)  —  V"oo(7r,  (/>)(x)  >  — for  all  x  G  X  by 
the  observation  that  from  the  assumption  and  Equation  (4),  we  have  that  for  all  x  G  X, 

V(x)  >  V*{x)  -  e  >  E„Vi(x)  -e-w>  T{V){x)  -  ye  -  e  -  w. 


■ 


From  the  approximate  receding  horizon  control  framework,  given  an  approximate  function  V,  the 
minimizer  will  play  the  policy  tt  such  that  =  T(y)  at  each  x  G  X.  That  is,  he  will  assume 
that  the  maximizer  will  play  the  correlated  equilibrium  policy  with  respect  to  V.  We  now  present 
the  game  of  value  when  the  maximizer  actually  plays  the  worst-case  scenario. 

Theorem  3.7  SupposeV^  is  selected  such  that  for  all  x  G  X,  T{Vf){x)  <  V'q*(x).  GivenV  G  B{X) 
such  that  for  some  n  >  0,  |b^(x)  —  E(x)|  <  e  for  all  x  in  X,  consider  a  policy  tt  for  the  minimizer 
such  that  for  all  x  G  X,  T.,^{y){x)  =  T{V){x).  Then,  for  all  x  G  X, 

2'ye 

0  <  Eoo(7r,  fnix)  -  V^{x)  <  { - 2C'^ax  + 

1  —  7  1  ~  7 


Before  we  provide  a  proof  of  this  theorem,  we  mention  here  that  setting  e  =  0  with  n  =  H  —  1 
gives  exactly  the  bound  of  Theorem  3.2.  Even  though  we  could  have  obtained  the  result  for 
Theorem  3.2  by  setting  e  =  0  with  n  =  H  —  1  here,  we  wanted  to  show  that  there  is  an  alternate 
but  simpler  proof  than  the  proof  below. 


Proof:  The  lower  bound  is  trivially  true  so  we  prove  the  upper  bound.  The  proof  technique  is 
quite  similar  to  the  proof  of  the  previous  theorem. 

For  all  X  G  X,  Eoo(vr,  (P*){x)  -  V^{x)  =  Eoo(vr,  (P*)ix)  -  T{V){x)  +  T{V){x)  -  V^{x).  We  have 
that  T{V){x)  —  Vf^{x)  <  'yeX- — (see  the  proof  of  the  previous  theorem).  It  remains  to  show 
that  Eoo(7r,r)  -r(E)(x)  <  ^^4^. 

Now,  for  all  x  G  X,  —ye  -|-  T{V){x)  <  I^_,_;^(x)  <  V*{x)  <  E(x)  -|-  e,  where  the  first  inequality 
is  from  Equation  (4)  and  the  second  inequality  is  from  Lemma  3.2  and  the  third  inequality  is  from 
the  assumption.  It  follows  that 


T{V){x)  =  T^(y){x)  =  sup  lCx{Tr{x),f)  +  -f'^Pxy{Tr{x),f)V{y) 


/6F(x) 


y&X 


> 


C'a;(7r(x),(/)*(x))  -hy  ^  P^y{TT{x),<j)*{x))V{y) 


yex 


>  Cx{TT{x),(j)*{x)) P^y{7T{x),  f* {x))[T{V){y)  -  e(l  -hy)] 

y&X 
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Keep  iterating  (under  the  sum  sign)  this  way,  we  have  that  for  all  A:  =  0, 1, and  x  G  X, 

k 


T{V){x)  >  E 


^7*C'a„(7r(xt),(/)*(xi))|xo  =  x  +  7^+^^;[r(K)(xfc+i)|xo  =  x] 
Lt=o 

-[7e(l  +  7)  +  •  •  •  +  7^^^e(l  +  7)], 


(9) 


where  xt  is  the  random  variable  representing  the  state  at  time  t  under  tt  and  (p* .  Since  T(K)  is 
bounded,  the  second  term  on  the  r.h.s.  of  Equation  (9)  converges  to  zero  as  k  ^  00  and  the  first 
term  becomes  KooCtt,  ((>)(x).  Therefore  it  follows  that  T{V){x)  —  V{'K,(j)*){x)  >  — •  ■ 

As  we  have  studied  in  subsection  3.1,  if  there  exists  a  function  5  defined  on  X  such  that 
0  <  Xlxex ‘^(®)  <  ^  Pxyif,g)  >  S{y)  for  all  x,y,g,f,  the  error  bounds  above  can  be  improved. 
Let  P  =  1  —  again.  We  only  present  the  upper  bound  case  of  Theorem  3.6  as  an  example. 

With  the  same  arguments  given  in  subsection  3.1, 


-  Voc{Tr,(j)){x)]  <  [1  -  C{j,P)]  ^max[l/^(x)  -  V^{Tr,(j)){x)]. 

X  X 


We  have 

max[K^(x)  -  K^(7r,  (p){x)]  <  + 

if  V{x)  —  V)((x)  <  e'.  But  e'  =  e  +  where  Cn  is  given  in  Equation  (3). 


1-7/3 


3.3.2  Infinite  horizon  average  cost 

Theorem  3.8  Assume  that  Assumption  2.1  holds.  Given  V  G  B{X)  sueh  that  for  some  n  >  0, 
Wni^)  ~  ^(^)l  ^  f  ^  eonsider  a  poliey  tt  for  the  minimizer  and  4>  for  the  maximizer 

sueh  that  for  all  x  G  X,  Tt^^^{V){x)  =  T{V){x).  Then,  for  all  x  G  X, 

n 

-  J^\  <  270-1  Cmax  +4e. 

n 

0  <  J^(7r,  (/)*)- <  270-1  +  4e. 


Proof:  Erom  the  assumption,  — e  <  V*{x)  —  V{x)  <  e  for  all  x  G  X.  Applying  the  T-operator 
to  each  side  and  using  the  monotonicity  property,  we  have  — e  <  T{V*){x)  —  T{V){x)  <  e  for  all 
X  G  X.  Therefore  we  have  that 

sp(r(K)  -  K)  <  sp(K;+i  -  VP)  +  4e. 

Applying  Theorem  3.3,  we  have  the  result.  The  error  bound  on  the  value  of  the  game  when  the 
maximizer  aetually  plays  the  worst-case  scenario  is  also  directly  obtained  from  Theorem  3.3.  | 

We  remark  that  we  can  add  the  //-recurrent  condition  (Assumption  3.1)  to  this  case  also  so  that 
we  can  eliminate  the  dependence  on  the  state  space  size  as  we  did  previously. 
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4  Examples  of  Approximate  Receding  Horizon  Control 

In  this  section,  we  introduce  three  approaches  as  examples  of  approximate  receding  horizon  control 
for  the  Markov  games.  These  are  heuristics  for  the  minimizer  who  seeks  to  optimize  his  performance 
under  the  guess  of  the  worst-case  scenario  from  the  opponent’s  play.  The  first  two  approaches  (to 
the  minimizer)  aim  at  improving  a  given  heuristic  policy  (or  a  set  of  multiple  heuristic  policies) 
that  is  available  to  the  minimizer,  based  on  the  policy  improvement  arguments.  The  final  approach 
is  motivated  by  hindsight  optimization  proposed  in  [13,  16].  By  this  approach,  at  each  state,  the 
minimizer  evaluates  his  candidate  randomized  actions  based  on  the  analysis  of  the  expected  optimal 
hindsight  performance  over  a  finite  horizon  under  the  guess  that  the  maximizer  plays  the  worst-case 
fixed  policy  chosen  by  the  minimizer. 

4.1  Rollout  algorithm 

Our  discussion  in  this  subsection  will  focus  on  the  discounted  case  first  and  then  consider  the 
average  case.  To  the  minimizer,  obtaining  an  equilibrium  policy  for  him  is  often  quite  difficult 
due  to  the  curse  of  dimensionality.  One  approach  to  take  when  a  heuristic  policy  is  available  to 
the  minimizer  is  to  assume  that  the  maximizer  has  chosen  a  fixed  policy  4>  €  ^  to  play  the  given 
Markov  game  and  then  to  try  to  improve  the  heuristic  policy  of  the  minimizer.  Because  it  is  also 
difficult  for  the  minimizer  to  get  the  worst  case  policy  (the  equilibrium  policy  for  the  maximizer), 
the  minimizer  will  need  to  choose  a  heuristic  worst  case  policy  for  the  maximizer.  For  some  cases, 
we  can  actually  get  (j)*  (see,  e.g.,  [1]  and  the  references  therein).  If  we  fix  the  maximizer’s  policy, 
the  resulting  game  becomes  a  Markov  decision  process  to  the  minimizer.  It  is  well-known  from 
the  policy  improvement  principle  that  given  a  policy  vr,  if  we  define  a  new  policy  vr^o  such  that 
<f)){x)  =  T^{Voc{7:',(f>)){x)  for  all  x  ^  X,  the  new  policy  tt^o  improves  the  policy  vr  in 
terms  of  the  infinite  horizon  discounted  cost.  That  is,  Voo{T^ro-,4>)  <  k(x)(7r,  (/>).  Because  this  holds 
for  arbitrary  (/>  G  4>,  vr^o  dominates  vr. 

Several  works  for  MDP  problems  (with  their  related  cost  function)  in  this  respect  have  reported 
successful  results.  For  example,  Bertsekas  and  Castanon  consider  stochastic  scheduling  problems  [9], 
Secomandi  [43]  studied  a  vehicle  routing  problem,  Ott  and  Krishnan  [37]  and  Kolarov  and  Hui  [30] 
studied  network  routing  problems,  Bhulai  and  Koole  [11]  consider  a  multi-server  queueing  problem, 
and  Koole  and  Nain  [32]  consider  a  two-class  single-server  queueing  model  under  a  preemptive 
priority  rule.  In  particular,  [11]  and  [32]  obtain  explicit  expressions  for  the  value  function  of  a  fixed 
threshold  policy,  which  plays  the  role  of  a  heuristic  base  policy,  and  showed  numerically  that  the 
rollout  of  the  policy  behaves  almost  optimally.  Chang  et  al.  [14]  also  empirically  showed  the  rollout 
of  a  fixed  threshold  policy  (Droptail)  works  well  for  a  buffer  management  problem.  Koole  [31]  also 
derived  the  deviation  matrix  of  the  M/M/l/oo  and  M/M/l/N  queue,  which  is  used  for  computing 
the  bias  vector  for  a  particular  choice  of  cost  function  and  a  certain  base  policy,  from  which  the 
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rollout  policy  of  the  base  policy  is  generated.  Note  that  in  queueing  systems  viewed  as  Markov 
games,  we  can  consider  a  worst-case  arrival  process  and  then  analyze  the  value  function  of  a  certain 
fixed  policy  with  the  worst-case  arrival  process,  from  which  we  generate  a  rollout  policy  for  the 
minimizer. 

As  a  receding  horizon  approach  for  this  improvement  scheme,  we  replace  Voo{7t,4>)  by  the 
value  of  the  game  when  the  policies  tt  and  cj)  are  followed  over  a  finite  horizon.  Formally, 
we  define  the  iJ-horizon  rollout  policy  'Kro,H  with  a  base  policy  tt  to  be  a  policy  'Kro,H  that 
satisfies  ,4>))ix)  =  T^(yH-i{Tr,(p)){x)  for  all  x  G  A  where  VH-i{Tr,(J))  := 

'^{Ei=oN*C'xt(7r(xi),(/)(xt))|xo  =  x}. 

We  present  the  result  regarding  the  i7-horizon  rollout  policy  adapted  from  [13]  and  provide  the 
proof  for  completeness.  We  first  begin  with  a  lemma  similar  to  Lemma  3.2. 

Lemma  4.1  Suppose  Vo{7r,(f))  is  selected  such  that  for  all  x  ^  X,  TT^^tp{Vo{7r,(p)){x)  <  Vo{7r,f)){x). 
Then,  for  H  =  1,2, ...,  and  for  all  x  G  A,  (vr,  (/>)(x)  <  VH-iiir,  4>){x). 

Proof:  The  statement  can  be  proven  by  induction  on  H  as  in  the  proof  of  Lemma  3.2.  ■ 

Proposition  4.1  Given  a  fixed  policy  (f  ^  ^  and  a  base  policy  tt  G  LI  for  the  minimizer,  suppose 
Vo{tt,(P)  is  selected  such  that  f)){x)  <  Vo{it,  f){x)  for  all  x  G  X.  For  any  e  >  0,  if 

H  >1  +  log.^  X  G  A,  Voo{T^ro,H,  4>){x)  <  Foo(vr,  (/>)(x)  -I-  e. 

Proof:  Define  tp  =  Vh-i{t^,  4>)-  By  definition  of  the  rollout  policy, 

T7Tro,HA'^)i^)  =  =  Cx{'Kro,H{x),(l){x))  +  7  X]  PxyAro,H{x),(j){x))'p{y) 

y&X 

<  C^{'K{x),<j){x))  +  7  X]  PxyAA)A{x))'ip{y)  =  VhA,  4>){x)  <  'f{x), 

yGX 

where  the  last  inequality  follows  from  Lemma  4.1.  Therefore,  for  all  x  G  A,  we  have 
VcxiAro,H,<P){x)  <  Vh-iA,  <P){x)  by  Lemma  3.1.  Now  we  can  write  for  all  x  G  A,  VocA,  <P){x)  = 
VH-i{TT,(j)){x)  +-f^~^E[Voo{7T,<j)){xH-i)\xo  =  x] .  We  know  that  mina,[l/oo(7r,  (/))(x)]  >  This 

implies  that  Voo{'Kro,H,4>){x)  <  14o(vr,  (/>)(x) -|- ^^  •7'^“^.  Letting  ^^•7'^“^  <  e  yields  the  desired 
result.  ■ 

We  note  again  that  the  minimizer  is  assuming  the  maximizer’s  play.  If  the  minimizer’s  guess 
on  the  worst-cast  scenario  is  good  in  the  sense  that  max^,  \Vh-i{tt,  <p){x)  —  V^_;^(x)|  <  e  with  a 
relative  small  value,  the  resulting  performance  will  be  bounded  by  Theorem  3.7  from  the  optimal 
equilibrium  performance. 

The  average  case  is  similar  to  the  discounted  case  except  that  we  define  the  rollout  pol¬ 
icy  with  respect  to  “T”’ -operators  —  the  rollout  policy  is  defined  as  a  policy  such  that 
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'^■n-ro  (p)){x)  for  all  x  G  X  where  Vh-i  is  obtained  with  7  =  1 

in  Vh-i  and  we  assume  that  Vq^tt,  (p)  is  zero  function.  The  principle  behind  this  is  also  the  policy 
improvement  scheme  (see,  e.g.,  [26])  under  the  assumptions  we  made  for  the  average  Markov  games, 
i.e.,  aperiodicity  and  irreducibility. 

Proposition  4.2  Assume  that  Assumption  2.1  holds.  Consider  the  H -horizon  rollout  poliey  'nro,H 
with  a  base  poliey  vr  with  respeet  to  (/>  G  Then 

Jco  iTro,H )  (p)  ^  Jco  {-K,(p)  +  277  ‘’-I  Cmax- 

To  prove  the  above  proposition,  we  start  with  a  lemma,  which  can  be  proven  by  the  invariance 
property  [23]  of  the  stationary  distribution  of  the  underlying  Markov  chain  (see,  e.g.,  [15]).  Note 
that  under  our  assumptions,  there  exists  a  stationary  distribution  over  X  under  any  policy  pair. 

Lemma  4.2  For  any  vr  G  11  and  (p  ^  a  stationary  distribution  over  X  exists,  and  for  all 
n  =  0,1,..., 

Joo{T^,(p)  =  '^[yn+i{T^,(p){y)  - 'i4(vr,(/>)(y)]P’"’‘^(2/). 

y&X 

In  partieular,  given  V  G  B(X)  and  (p  £  if  tt  is  defined  sueh  that  =  T(i,{V){x)  for  all  x  G 

X ,  then 

y&X 

Lemma  4.3  For  n  =  0, 1, ...,,  and  any  vr  G  11  and  (p  £ 

—  —  n 

max[14+l(7r,  (())(x)  -  Vn{'K,(p){x)]  <  Joc{tT,<P)  +2r/»-lC'max 


Proof:  As  in  the  statement  of  Lemma  3.3,  we  can  show  that  for  n  =  0,1,...,  sp(14+i(7r,  (/>)  — 

—  n 

yn{T^,(p))  <  2r/*-iC'max  by  the  similar  reasoning  to  that  given  in  page  234-235  in  [46].  By 
Lemma  4.2, 


min[14+i(7r,(/))(x)  -  Vn{'K,(p){x)]  <  Joo{x,(p)  <  max[14+i(7r,  (())(x)  -  Vnix ,  (p){x)]. 

X  X 

It  follows  that  maXa:[V)i_|_i(7r,  (p){x)  — 14 (tt,  (p){x)\  —  Jooix,  (p)  <  sp(i4+i(7r,  (p)  ~  yniT^,  <P))-  Therefore 
the  result  follows.  I 


We  are  now  ready  to  prove  the  proposition  above. 
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Proof: 


Jooi^rOjH )  */*) 


=  ^[T0(yH'_i(7r,  (p)){x)  —  from  Lemma  4.2 

X 

<  m^(T0(i//f_i(7r,  (/)))(x)  -  l4f-i(7r,  (/))(x)) 

<  max(PH(vr,  (/>)(x)  -  i4f-i(7r,  (/))(x)) 

X 

H-\ 

<  Joo{tt,  4>)  +  2??  ‘>-1  C'max  from  Lemma  4.3 


I 

Therefore,  if  Lf  >  1  +  (s  —  1)  log^  — ,  the  rollout  policy  dominates  the  heuristic  base  policy  by  e. 
By  adding  the  /U-recurrent  condition,  the  similar  result  can  be  obtained. 

4.2  Parallel  rollout 

When  a  good  heuristic  policy  is  available  to  the  minimizer  and  a  fixed  worst-case  policy  can  be 
assumed  for  the  maximizer,  the  performance  of  the  rollout  policy  played  by  the  minimizer  will 
be  promising  because  it  will  improve  the  performance  of  the  heuristic  policy  for  the  minimizer. 
However,  often  getting  a  good  heuristic  policy  to  roll  out  is  very  difficult.  This  will  be  particularly 
true  for  the  case  where  for  some  trajectories  of  the  states,  a  heuristic  policy  is  good  and  for  other 
trajectories  of  the  states,  another  heuristic  policy  is  good,  etc.  As  a  simple  example,  for  a  multiclass 
scheduling  problem  where  the  cost  is  a  function  of  the  delay  and  the  (importance)  weight  of  the 
class,  the  performances  of  the  static  priority  policy  and  the  earliest  deadline  first  policy  depend  on 
the  system  trajectories  (see  [13]  for  a  detailed  discussion). 

As  a  generalization  of  the  rollout  approach,  we  consider  a  finite  set  of  multiple  heuristic  policies. 
The  minimizing  player  seeks  to  combine  dynamically  the  given  heuristic  policies  in  the  set  to  adapt 
to  the  different  trajectories  of  the  system  to  improve  the  performance  of  all  policies  in  the  set  under 
the  assumption  that  the  maximizing  player  plays  a  fixed  worst-case  policy  chosen  by  the  minimizer. 
As  in  the  rollout  algorithm  discussion,  we  first  study  the  discounted  cost  case  and  then  the  average 
cost  case. 

As  we  mentioned  before,  if  we  fix  the  maximizer’s  policy,  the  resulting  game  becomes  a  Markov 
decision  process  to  the  minimizer.  Consider  a  finite  set  A  C  H.  It  has  been  shown  in  [13]  that 
if  we  define  a  new  policy  such  that  r^p^^^(min^gA  Coo(7r,  (/>))(x)  =  r0(min^gA  Coo(7r,  (/>))(x) 
for  all  X  G  A,  where  min  is  defined  componentwise  on  A,  the  new  policy  tt^j.  improves  all  of 
the  policies  in  A  in  terms  of  the  infinite  horizon  discounted  cost  (to  see  this,  we  simply  show 
that  T7rp^^,^(min7reA  Coo(7r,  i;A))(x)  <  min^gA  Coo(7r,  (/>)(x)  for  all  x  G  A).  That  is,  for  all  x  G  A, 
14o(vrpr,  </>)(x)  <  min^gA  Coo(7r,  (/>)(x).  Therefore,  i^pr  dominates  any  policy  tt  G  A. 
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As  we  have  done  for  the  rollout  approach,  we  define  formally  the  i7-horizon  parallel  roll¬ 
out  policy  TTpr^H  with  a  finite  set  A  of  base  policies  vr  G  11  to  be  a  policy  such  that 
Vn-iiTT,  (p)){x)  =  r0(min^gA  Vn-ii-n-,  (p)){x)  for  all  x  G  X. 

We  now  give  the  main  result  regarding  the  i7-horizon  parallel  rollout  policy.  It  states  that 
the  parallel  rollout  policy  dominates  any  policy  in  A  by  a  small  error,  which  is  determined  by  the 
receding  horizon  size. 

Proposition  4.3  Let  A  CH  be  a  nonempty  finite  set  of  stationary  polieies.  Given  a  fixed  poliey 
4>  G  ^  for  the  maximizer,  suppose  for  eaeh  tt  G  A,  Vo{iT,fi)  is  seleeted  sueh  that  for  all  x  G  X, 
^7r,</)(bb(7r,  (/>))(a:)  <  Vo(7r,  i^)(x).  For  -Kpr^H  defined  on  A  and  played  by  the  minimizer,  given  any 
e>0,  ifH>l  +  \og^^-^,  then  for  all  x  G  X,  Voo{T^pr,H,  4’){x)  <  min,reA  b"oo(vr,  4>){x)  -|-  e. 

Proof:  The  idea  of  the  proof  is  similar  to  that  of  Proposition  4.1.  We  define  'ip{x)  = 
min^gA  VH-iix-,  'i>){x)  for  ah  x  G  X. 

=  T^{i)){x)  =  Ca:{Trpr,H{x),(l){x))  +  J  P^yi'^pr,H{x) ,  (l>{x))fi{y) 

y&X 

<  Cx{'k{x),(I){x))  +7  X]  Pxy{T^i^),f{x))VH-l{TT,4>){y) 

2/ex 

for  any  vr  G  A  from  the  definition  of  Xpr^H 
=  yH{x,(f)){x)  <  Vh-i{tt,  fi){x)  by  the  given  assumption  and  Lemma  3.2 

It  follows  that  <  fi’{x)  for  all  x  G  X.  Therefore,  for  all  x  G  X,  we  have 

Vcxi{'^pr,H,<l>){x)  <  min^gA  b/f-iCx,  i;/>)(x)  by  Lemma  3.1.  We  know  that  Voc{xpr^H,fi)ix)  < 
min^gA  Voo{x,  (p){x)  -|-  (c.f..  Proposition  4.1).  Letting  <  e  yields  the  desired 

result.  I 

For  the  average  cost  case,  the  definition  of  the  parallel  rollout  policy  in  the  discounted  case  is 
replaced  with  “T”-operator  and  Vh-i-  That  is,  the  Lf-horizon  parallel  rollout  policy  Tipr^H  with  a 
finite  set  A  of  base  policies  tt  G  11  with  respect  to  a  policy  i;/)  G  is  defined  as  a  policy  such  that 
T;rj,^  ^,<^(min^gA  VH-i{7r,  fi))(x)  =  ^^(min^gA  VH-i(7r,  fi))(x)  for  all  x  G  X. 

We  first  analyze  the  performance  of  the  Ll-horizon  parallel  rollout  policy  compared  with  those 
obtained  by  policies  in  A.  For  this  purpose,  for  any  tt  G  11  and  G  define  Jn’"^  (x)  =  for 

all  X  G  A  and  n  =  1,  2, ....  That  is,  this  is  the  n-horizon  approximation  of  the  value  of  the  game  for 
the  average  cost  when  the  minimizer  plays  vr  and  the  maximizer  plays  (f.  With  similar  arguments 
as  Platzman’s  given  in  Section  3.3  in  [41],  we  can  show  that  Jn'‘^{x)  converges,  uniformly  in  x,  as 
0(n“^),  to  Joo(vr,(/)),  n  =  1,2,.... 


Theorem  4.1  Assume  that  Assumption  2.1  holds.  Consider  the  H -horizon  parallel  rollout  poliey 
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'^ro,H  with  a  finite  set  A  C  11  with  respeet  to  a  poliey  G  Then 

JooiTpr^H,  4>)  <'Y]  Joo(argmin  +  277^C'max- 

ttGA 

X 

Proof:  We  first  observe  that 

r<^(min  VH-i{n,  fi))(x)  =  Vff-i(7r,  fi))(x)  by  definition  of  TTpr^H 

ttGA  ’  ttGA 

=  C:r(7rpr,H(x),fi(x))  +  Y]  Pxy(7rpr,H(x),(f>(x))  min  VH-i(7r,  fi)(y) 

yGX 

<  C'a;(7r(x),(/)(x))  +  ^  Pxy{T^{x),fi{x))yH-\{T^A){v)  for  any  vr  G  A 

y&X 

=  VH{'7r,(p){x). 


Therefore,  for  all  x  G  X,  T(^(min^gA  Vh-i{x,  fi)){x)  <  min^gA  VniT^,  fi){x).  Now, 

Joo{xpr,HA)  =  y~][A(minyH_i(7r,(/>))(x)  -  min  VH-_i(7r,  (/))(x)]P’'r’-’-f^’‘'^(x)  by  Lemma  4.2 

'  ^  ttGA  ttGA 

X 


<  'Y[mmVH{x,(p){x)  —  min  VH'_i(7r,  (/>)(x)]P^r’’’^’‘^(x) 

ttGA  ttGA 

X 

<  y^[i/H-(argmin  ^(x),  (/))(x)  —  i//f_i(argmin  J]^_i(x),  (/))(x)]P’^p'''"^’'^(x) 

ttGA  ttGA 

X 

<  Joo(argmin  (x),(J))P'"p'''^'‘^(x)  +  2p  »-i  Cmax  by  Lemma  4.3  . 

ttSA 

X 


■ 

From  the  result  given  in  the  above  theorem,  we  can  now  discuss  the  convergence  rate  of  the  H- 
horizon  parallel  rollout  policy.  The  second  error  term  will  approach  zero  geometrically  in  p  as 
H  ^  CO  and  argmin^gA  i(a^)  will  approach  to  the  policy  arg min^rgA  in  In  the 

limit,  the  parallel  rollout  policy  will  improve  all  policies  in  A.  We  remark  that  if  for  each  tt  G  A, 
Vo{TT,fi)  is  selected  such  that  for  all  x  G  X,  T^^^{Vo{tt,  fi)){x)  <  Vo{it,  fi){x),  then  we  can  write  the 
result  of  the  above  theorem  as  follows: 

H-l 

Joo{T^pr,H-,4>)  <  niin  Joo  {tT,4>)  +  2r?  ‘>-1  Cmax- 
ttSA 

We  conclude  the  discussion  of  the  (parallel)  rollout  with  a  remark  on  the  minimizer’s  guess  of 
the  maximizer’s  play.  The  above  parallel  rollout  approach  for  the  minimizer  naturally  gives  a  way 
of  guessing  a  worst-case  scenario  of  the  maximizer  to  the  minimizer.  Suppose  the  minimizer  can 
guess  the  best  response  from  the  maximizer  when  he  plays  a  given  heuristic  policy  vr  G  A.  In  this 
case,  the  minimizer  considers  a  finite  set  17  C  of  multiple  heuristic  policies  for  the  maximizer 
and  defines  a  policy  (Amax(x)  =  arg max^g^ [miuTreA  Vooix,  (l))]{x)  for  all  x  G  X,  and  uses  the  policy 
0max  as  the  fixed  policy  for  the  maximizer. 
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4.3  Hindsight  optimization 

The  recently  proposed  approach  called  hindsight  optimization  [13]  to  solving  Markov  decision  pro¬ 
cesses  can  be  also  extended  to  solve  Markov  games  if  we  fix  a  policy  for  the  maximizer.  Under  the 
assumption  that  the  opponent  (the  maximizer)  plays  his  best  policy  (chosen  by  the  minimizer),  the 
hindsight  optimizing  minimizer  plays  the  game  at  each  state  based  on  his  analysis  on  the  expected 
optimal  “retroactive”  performance. 

Given  a  policy  4>  G  define  a  function  pn^cj,  G  B{X)  such  that 

Pn,<i>{x)=E\  min  {xn)\xo  =  x\ ,  gt  G  G{xt)  for  all  t  (10) 

and  call  this  the  “hindsight  optimal”  value  of  state  x  because  it  stands  for  the  (expected)  value 
of  taking  (randomized)  actions  that  the  minimizer  wishes  to  take  if  he  encounters  the  particular 
random  trace  of  the  game.  For  the  average  cost  case,  we  simply  set  7  =  1  and  refer  to  the  value  as 

Pn,(f>{x) . 

Given  a  policy  (p  for  the  maximizer,  we  formally  define  the  iJ-horizon  hindsight  optimization 
policy  as  a  policy  TTho,H  such  that  for  all  x  G  X,  H,<t>^PH-i,4>){x)  =  T^{pH-i,p)ix).  The  average 
case  is  defined  with  “T” -operator  with  pH-i,(j,-  Because  the  minimization  over  the  sequence  of  the 
randomized  actions  is  inside  the  expectation  in  Equation  10,  this  corresponds  to  solving  the  sample- 
path  problem,  which  is  deterministic.  The  hindsight  optimal  value  of  state  x  is  a  lower  bound  to 
the  equilibrium  value  if  we  set  4>  =  (f>*  because  by  Jensen’s  inequality,  Pn,(j>{x)  <  Vn{k,  (p){x)  for  any 
TT  G  n  (for  discounted  case)  and  also  pn^,f>{x)  <  t^(7r,  (p){x)  for  any  tt  G  H  (for  average  case). 

It  is  quite  difficult  to  give  a  bound  on  the  hindsight  optimal  value  without  restrictive  conditions 
on  the  game.  However,  we  believe  that  studying  this  issue  is  important.  For  this  purpose,  we 
introduce  an  equivalent  model  description  of  Markov  games.  We  can  derive  a  function  called  the 
next  state  funetion  P  :  X  x  G{X)  x  F{X)  x  [0, 1]  ^  X  from  the  transition  function  P.  In  other 
words,  given  a  policy  pair  vr  and  (p  and  the  current  state  x,  a  random  number  w  selected  uniformly 
from  [0,1]  can  be  mapped  to  Pxy{7r{x),(p{x))  for  some  y  G  X.  That  is,  xt+i  =  P{xt,at,wt)  with 
so-called  random  disturbance  wt  G  [0, 1] .  The  average  payoff  function  G  is  also  newly  defined  by  G 
such  that  Gx{tt{x),(P{x))  =  Exu{Cx{tt{x),  (p{x),w)).  See  Bertsekas’  book  of  definitions  on  MDP  [8] 
or  Ng’s  deterministic  (partially  observable)  MDP  model  for  a  related  construction  [36]. 

Now  we  define  a  function  Q  such  that 

n— 1 

Q{xo,7To,  ...,7rn-l,Wo,  .■.,Wn-l)  =  Cxt{TTt{xt) ,  (p{xt) ,  Wt)  (^^n) 

t=0 

and  for  convenience,  we  will  abbreviate  this  to  Q{xo,Tr,w)  in  an  obvious  notation,  where  w  =< 
Wo,  ■■■,Wn-i  >G  [0, 1]”  and  tt  =  {tto,  ...,7r„_i}.  Then, 

Pn,<t>ixo)  =  E^[mmQ{xo,TT,w)] 
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because  the  minimization  over  nonstationary  policy  is  equivalent  to  the  minimization  over  the 
(randomized)  action  sequences  given  w. 

Proposition  4.4  Suppose  Cx{g,f,w)  is  convex  as  a  function  of  g  and  w  jointly  for  every  fixed 
X  G  X  and  f  £  F{x)  and  Vf  is  convex  as  a  function  of  x.  Then,  for  all  x  £  X, 

0  <  inf  Vn{TT,(t)){x)  -  Pn,<t,{x)  <  Q(x,  TTq.s ,  0.5)  -  E^[Q{x,ko,5,w)] 

ttGIT 

where  0.5  is  a  vector  of  size  n  with  every  entry  0.5  and  ttq.s  solves  inf-gjj[(5(x,  vr,  0.5)] . 

Proof:  First,  under  our  assumptions,  the  function  Q  is  convex  in  the  space  of  w  and  tt  ([0,  Ij” 
and  a  cartesian  product  of  polyhedral  sets  respectively),  whose  cartesian  product  space  is  a  convex 
set.  Therefore,  we  can  directly  apply  Avriel  and  Williams’  theorem  on  the  Jensen’s  inequality  on 
expected  value  of  perfect  information  [22] .  I 

The  same  result  holds  for  the  average  cost  case  (with  7  =  1)  and  in  particular  if  (/>  =  (/>*,  the 
proposition  above  gives  a  bound  between  the  hindsight  optimal  value  and  the  n-horizon  equilibrium 
value. 

We  remark  that  the  hindsight-optimization  based  approach  appeals  to  the  game-theoretic  frame¬ 
work  so  that  this  is  different  from  the  simulation-based  approach  used  in  the  computer  bridge  game 
player  (GIB)  in  [20].  The  approach  taken  there  can  be  viewed  as  follows  in  the  context  of  our  dis¬ 
cussion:  many  sample  paths  are  drawn  and  for  each  sample  path,  the  optimal  solution  with  respect 
to  the  sample  path  is  analyzed  after  taking  each  deterministic  candidate  action,  and  one  counts  the 
number  of  times  that  a  particular  deterministic  action  achieves  the  minimum  cost  sum,  and  takes 
a  deterministic  action  by  voting.  It  would  be  interesting  to  compare  two  approaches  in  practical 
applications. 

5  Implementation  and  Research  Directions 

In  this  subsection,  we  briefly  discuss  how  we  can  implement  the  (approximate)  receding  horizon 
approaches  we  discussed  before  in  practice  and  discuss  some  issues  and  directions  for  the  future 
research. 

There  is  previous  work  done  by  Kearns  et  al.  [28]  that  presents  an  algorithm  that  uses  samples 
to  estimate  V*  (the  undiscounted  finite  horizon  value  of  game)  within  a  given  error  bound,  which 
can  be  easily  adapted  to  the  discounted  setting.  They  analyzed  the  necessary  number  of  sampling 
to  obtain  a  desired  accuracy.  The  per-state  running  time  of  their  algorithm  is  independent  of 
the  state  space  size  but  exponential  in  the  horizon  size.  Note  that  finite  horizon  value  iteration’s 
computation  complexity  depends  on  the  state  space  size,  even  though  it  depends  on  the  horizon 
size  linearly,  so  that  applying  it  for  a  game  with  a  very  large  state  space  is  difficult. 
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The  exponential  dependence  on  the  horizon  size  can  be  alleviated  by  using  the  three  heuristic 
approaches  we  discussed.  We  can  simply  use  a  Monte-Carlo  simulation  to  estimate  the  relevant 
function  values.  For  example,  the  minimizer  who  uses  the  i7-horizon  rollout  policy  simulates  the 
given  heuristic  base  policy  and  the  fixed  policy  for  the  maximizer  using  sampling  over  a  finite 
horizon  H  —  1,  and  the  results  of  the  simulation  are  used  to  “select”  the  (apparently)  best  current 
randomized  action  at  the  current  state.  We  assume  that  there  is  a  selection  function  available 
that  extracts  the  randomized  action  that  achieves  the  infimum/supremum.  The  randomized  action 
selected  is  the  randomized  action  with  the  highest  “utility”  at  the  current  state,  as  estimated  by 
sampling.  Of  course,  we  can  use  various  sampling  techniques  (see,  e.g.,  [33]),  such  as  importance 
sampling,  to  improve  the  estimation  procedure.  Therefore,  the  rollout/parallel  rollout  approach  is 
practically  viable.  On  the  other  hand,  the  hindsight  optimization  approach  needs  to  have  a  fast 
hindsight  problem  solver. 

Extending  the  receding  horizon  framework  to  the  Wperson  (N  >  3)  case  and  analyzing  the 
performance  will  be  difficult,  because  no  iteration  algorithm  based  on  a  contraction  mapping  is 
available  to  the  authors’  knowledge.  However,  each  player  can  heuristically  use  the  rollout /parallel 
rollout  and  the  hindsight  optimization  for  his  policy  choice. 

We  can  also  consider  applying  the  three  heuristics  to  nonzero-sum  stochastic  games.  Analyzing 
the  structure  of  equilibrium  policies,  in  this  case,  is  often  more  difficult  than  for  zero-sum  games. 
For  zero-sum  games,  a  standard  technique,  e.g.,  value  iteration,  can  be  used  (see,  e.g.,  [1,  3]  and 
references  therein).  However,  for  nonzero-sum  games,  we  need  to  use  a  different  non-standard 
technique  (see,  e.g.,  [2])  to  analyze  the  structure,  which  is  quite  cumbersome. 

Finally,  we  can  incorporate  the  idea  of  Neuro-Dynamic  programming  (NDP)  [10]  into  the  ap¬ 
proximate  receding  horizon  control  framework.  That  is,  the  feature-based  approximations  in  NDP 
can  be  applied  when  we  estimate  the  value  of  the  underlying  subgame,  although  how  to  extract 
good  features  is  a  difficult  problem  in  general. 
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